computational requirement
Towards Sustainable Precision: Machine Learning for Laser Micromachining Optimization
Correas-Naranjo, Luis, Camacho-Sánchez, Miguel, Launet, Laëtitia, Zuric, Milena, Naranjo, Valery
In the pursuit of sustainable manufacturing, ultra-short pulse laser micromachining stands out as a promising solution while also offering high-precision and qualitative laser processing. However, unlocking the full potential of ultra-short pulse lasers requires an optimized monitoring system capable of early detection of defective workpieces, regardless of the preprocessing technique employed. While advances in machine learning can help predict process quality features, the complexity of monitoring data necessitates reducing both model size and data dimensionality to enable real-time analysis. To address these challenges, this paper introduces a machine learning framework designed to enhance surface quality assessment across diverse preprocessing techniques. To facilitate real-time laser processing monitoring, our solution aims to optimize the computational requirements of the machine learning model. Experimental results show that the proposed model not only outperforms the generalizability achieved by previous works across diverse preprocess-ing techniques but also significantly reduces the computational requirements for training. Through these advancements, we aim to establish the baseline for a more sustainable manufacturing process.
- North America > United States (0.04)
- Europe > Spain > Valencian Community > Valencia Province > Valencia (0.04)
- Europe > Germany > North Rhine-Westphalia > Cologne Region > Aachen (0.04)
ClearVision: Leveraging CycleGAN and SigLIP-2 for Robust All-Weather Classification in Traffic Camera Imagery
Sivaraman, Anush Lakshman, Adu-Gyamfi, Kojo, Shihab, Ibne Farabi, Sharma, Anuj
Adverse weather conditions challenge safe transportation, necessitating robust real-time weather detection from traffic camera imagery. We propose a novel framework combining CycleGAN-based domain adaptation with efficient contrastive learning to enhance weather classification, particularly in low-light nighttime conditions. Our approach leverages the lightweight SigLIP-2 model, which employs pairwise sigmoid loss to reduce computational demands, integrated with CycleGAN to transform nighttime images into day-like representations while preserving weather cues. Evaluated on an Iowa Department of Transportation dataset, the baseline EVA-02 model with CLIP achieves a per-class overall accuracy of 96.55\% across three weather conditions (No Precipitation, Rain, Snow) and a day/night overall accuracy of 96.55\%, but shows a significant day-night gap (97.21\% day vs.\ 63.40\% night). With CycleGAN, EVA-02 improves to 97.01\% per-class accuracy and 96.85\% day/night accuracy, boosting nighttime performance to 82.45\%. Our Vision-SigLIP-2 + Text-SigLIP-2 + CycleGAN + Contrastive configuration excels in nighttime scenarios, achieving the highest nighttime accuracy of 85.90\%, with 94.00\% per-class accuracy and 93.35\% day/night accuracy. This model reduces training time by 89\% (from 6 hours to 40 minutes) and inference time by 80\% (from 15 seconds to 3 seconds) compared to EVA-02. By narrowing the day-night performance gap from 33.81 to 8.90 percentage points, our framework provides a scalable, efficient solution for all-weather classification using existing camera infrastructure.
- Government (0.75)
- Transportation (0.55)
Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models
Pawar, Pranav, Shah, Kavish, Bhalani, Akshat, Kasat, Komal, Mittal, Dev, Gala, Hadi, Patil, Deepali, Raichada, Nikita, Deshmukh, Monali
In recent years, VLMs have captured the imagination of the Artificial Intelligence(AI) community, demonstrating an impressive ability to interpret, reason about, and generate content that covers both text and image handling. From answering questions about visual scenes to engaging in multi-modal dialogue, models such as Flamingo [1], PaLI [25], and BLIP-2 [14] are redefining the frontier of vision intelligence. Y et, as these models are widening their application capabilities, a fundamental question emerges: can they truly reason, or are they sophisticated pattern matchers? To explore this question, we turn to the domain of physics--a field that serves as a universal benchmark for logical thoughts of a human being. Physics problems are an ideal testbed for VLMs, as they are multi-modal, combining textual descriptions, mathematical equations, and often clarifying diagrams. A model that can successfully solve these problems must not only understand language and images but also grasp the underlying relationships and principles that govern the physical realm. The challenge, uptil now, has been the lack of accessible tools for this kind of evaluation. Existing benchmarks for scientific reasoning, such as ARC [7] and ScienceQA [17], are often limited to basic text-only question sets, while those that incorporate visual elements, like MathVista [18], frequently depend on complex physics simulators that are computationally expensive for many researchers to deploy, thereby restricting reproducibility.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.89)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)
rECGnition_v2.0: Self-Attentive Canonical Fusion of ECG and Patient Data using deep learning for effective Cardiac Diagnostics
Srivastava, Shreya, Kumar, Durgesh, Jiwari, Ram, Seth, Sandeep, Sharma, Deepak
The variability in ECG readings influenced by individual patient characteristics has posed a considerable challenge to adopting automated ECG analysis in clinical settings. A novel feature fusion technique termed SACC (Self Attentive Canonical Correlation) was proposed to address this. This technique is combined with DPN (Dual Pathway Network) and depth-wise separable convolution to create a robust, interpretable, and fast end-to-end arrhythmia classification model named rECGnition_v2.0 (robust ECG abnormality detection). This study uses MIT-BIH, INCARTDB and EDB dataset to evaluate the efficiency of rECGnition_v2.0 for various classes of arrhythmias. To investigate the influence of constituting model components, various ablation studies were performed, i.e. simple concatenation, CCA and proposed SACC were compared, while the importance of global and local ECG features were tested using DPN rECGnition_v2.0 model and vice versa. It was also benchmarked with state-of-the-art CNN models for overall accuracy vs model parameters, FLOPs, memory requirements, and prediction time. Furthermore, the inner working of the model was interpreted by comparing the activation locations in ECG before and after the SACC layer. rECGnition_v2.0 showed a remarkable accuracy of 98.07% and an F1-score of 98.05% for classifying ten distinct classes of arrhythmia with just 82.7M FLOPs per sample, thereby going beyond the performance metrics of current state-of-the-art (SOTA) models by utilizing MIT-BIH Arrhythmia dataset. Similarly, on INCARTDB and EDB datasets, excellent F1-scores of 98.01% and 96.21% respectively was achieved for AAMI classification. The compact architectural footprint of the rECGnition_v2.0, characterized by its lesser trainable parameters and diminished computational demands, unfurled several advantages including interpretability and scalability.
- Asia > India > Uttarakhand > Roorkee (0.05)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- (2 more...)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.46)
Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference
We present Entropy Adaptive Decoding (EAD), a novel approach for efficient language model inference that dynamically switches between different-sized models based on prediction uncertainty. By monitoring rolling entropy in model logit distributions, our method identifies text regions where a smaller model suffices and switches to a larger model only when prediction uncertainty exceeds a threshold. Unlike speculative decoding approaches that maintain perfect output fidelity through verification, EAD accepts controlled output divergence in exchange for computational efficiency. Our experiments on the MATH benchmark demonstrate remarkable efficiency gains across different model families. Using the LLaMA family, we maintain 96.7\% of the 11B model's performance (50.4\% vs 52.1\%) while using it for only 43\% of tokens, decreasing computational cost by 41.5\%. These gains become more pronounced with larger size differentials in the Qwen family, where we achieve 92.9\% of the 14B model's performance (74.3\% vs 80.0\%) while using it for just 25\% of tokens, decreasing computational cost by 67\%. The consistency of these results across model pairs suggests that language model computation can be significantly optimized by selectively deploying model capacity based on local generation complexity. Our findings indicate that current approaches to model inference may be unnecessarily conservative in their pursuit of perfect output fidelity, and that accepting minor performance trade-offs can enable dramatic reductions in computational costs.
GestLLM: Advanced Hand Gesture Interpretation via Large Language Models for Human-Robot Interaction
Kobzarev, Oleg, Lykov, Artem, Tsetserukou, Dzmitry
This paper introduces GestLLM, an advanced system for human-robot interaction that enables intuitive robot control through hand gestures. Unlike conventional systems, which rely on a limited set of predefined gestures, GestLLM leverages large language models and feature extraction via MediaPipe to interpret a diverse range of gestures. This integration addresses key limitations in existing systems, such as restricted gesture flexibility and the inability to recognize complex or unconventional gestures commonly used in human communication. By combining state-of-the-art feature extraction and language model capabilities, GestLLM achieves performance comparable to leading vision-language models while supporting gestures underrepresented in traditional datasets. For example, this includes gestures from popular culture, such as the ``Vulcan salute" from Star Trek, without any additional pretraining, prompt engineering, etc. This flexibility enhances the naturalness and inclusivity of robot control, making interactions more intuitive and user-friendly. GestLLM provides a significant step forward in gesture-based interaction, enabling robots to understand and respond to a wide variety of hand gestures effectively. This paper outlines its design, implementation, and evaluation, demonstrating its potential applications in advanced human-robot collaboration, assistive robotics, and interactive entertainment.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Colorado > Boulder County > Boulder (0.04)
A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness
Wang, Fali, Zhang, Zhiwei, Zhang, Xianren, Wu, Zongyu, Mo, Tzuhao, Lu, Qiuhao, Wang, Wanjing, Li, Rui, Xu, Junjie, Tang, Xianfeng, He, Qi, Ma, Yao, Huang, Ming, Wang, Suhang
Large language models (LLMs) have demonstrated emergent abilities in text generation, question answering, and reasoning, facilitating various tasks and domains. Despite their proficiency in various tasks, LLMs like PaLM 540B and Llama-3.1 405B face limitations due to large parameter sizes and computational demands, often requiring cloud API use which raises privacy concerns, limits real-time applications on edge devices, and increases fine-tuning costs. Additionally, LLMs often underperform in specialized domains such as healthcare and law due to insufficient domain-specific knowledge, necessitating specialized models. Therefore, Small Language Models (SLMs) are increasingly favored for their low inference latency, cost-effectiveness, efficient development, and easy customization and adaptability. These models are particularly well-suited for resource-limited environments and domain knowledge acquisition, addressing LLMs' challenges and proving ideal for applications that require localized data handling for privacy, minimal inference latency for efficiency, and domain knowledge acquisition through lightweight fine-tuning. The rising demand for SLMs has spurred extensive research and development. However, a comprehensive survey investigating issues related to the definition, acquisition, application, enhancement, and reliability of SLM remains lacking, prompting us to conduct a detailed survey on these topics. The definition of SLMs varies widely, thus to standardize, we propose defining SLMs by their capability to perform specialized tasks and suitability for resource-constrained settings, setting boundaries based on the minimal size for emergent abilities and the maximum size sustainable under resource constraints. For other aspects, we provide a taxonomy of relevant models/methods and develop general frameworks for each category to enhance and utilize SLMs effectively.
- North America > United States (1.00)
- Asia (1.00)
- Europe (0.92)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Research Report > Experimental Study (0.67)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- (2 more...)
A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks
Liang, Chia Xin, Tian, Pu, Yin, Caitlyn Heqi, Yua, Yao, An-Hou, Wei, Ming, Li, Wang, Tianyang, Bi, Ziqian, Liu, Ming
This survey and application guide to multimodal large language models(MLLMs) explores the rapidly developing field of MLLMs, examining their architectures, applications, and impact on AI and Generative Models. Starting with foundational concepts, we delve into how MLLMs integrate various data types, including text, images, video and audio, to enable complex AI systems for cross-modal understanding and generation. It covers essential topics such as training methods, architectural components, and practical applications in various fields, from visual storytelling to enhanced accessibility. Through detailed case studies and technical analysis, the text examines prominent MLLM implementations while addressing key challenges in scalability, robustness, and cross-modal learning. Concluding with a discussion of ethical considerations, responsible AI development, and future directions, this authoritative resource provides both theoretical frameworks and practical insights. It offers a balanced perspective on the opportunities and challenges in the development and deployment of MLLMs, and is highly valuable for researchers, practitioners, and students interested in the intersection of natural language processing and computer vision.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Indiana (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (5 more...)
- Workflow (1.00)
- Research Report > Promising Solution (1.00)
- Overview (1.00)
- (2 more...)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Services (1.00)
- (9 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- (8 more...)